⚡️ Speed up function _sample by 21%
#45
Open
+9
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 21% (0.21x) speedup for
_sampleindatacompy/fugue.py⏱️ Runtime :
10.3 milliseconds→8.57 milliseconds(best of201runs)📝 Explanation and details
The optimization achieves a 20% speedup through three key improvements:
What was optimized:
len(df)computation - Stored indf_lenvariable to avoid repeated callsreset_index()calls - Added a check to return the DataFrame directly when it already has a default RangeIndex (start=0, step=1)ignore_index=Trueindf.sample()- Replaced the separatereset_index(drop=True)call with pandas' built-in index resettingWhy it's faster:
ignore_index=Trueinsample()is more efficient than callingsample()followed byreset_index(drop=True)Performance impact based on test results:
The optimization shows dramatic improvements when returning all rows (1000%+ faster in many cases) due to the zero-copy path, and modest but consistent 8-13% improvements when actually sampling. This makes sense given that the
_samplefunction is called in data comparison reporting workflows where it processes both unique row samples and mismatch samples - operations that frequently return the entire DataFrame when sample counts are large relative to data size.Hot path relevance:
Based on the function references,
_sampleis called multiple times during report generation (_get_compare_result,_aggregate_stats) and appears to be in performance-critical data comparison workflows where these micro-optimizations compound across multiple sampling operations.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
⏪ Replay Tests and Runtime
test_pytest_teststest_sparktest_helper_py_teststest_fuguetest_fugue_polars_py_teststest_fuguetest_fugue_p__replay_test_0.py::test_datacompy_fugue__sampleTo edit these changes
git checkout codeflash/optimize-_sample-mi6jonhmand push.